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Abstract. This paper addresses the large-scale acquisition of end-to- 
end network performance. We made two distinct contributions: ordinal 
rating of network performance and inference by matrix completion. The 
former reduces measurement costs and unifies various metrics which eases 
their processing in applications. The latter enables scalable and accurate 
inference with no requirement of structural information of the network 
nor geometric constraints. By combining both, the acquisition problem 
bears strong similarities to recommender systems. This paper investi- 
gates the applicability of various matrix factorization models used in 
recommender systems. We found that the simple regularized matrix fac- 
torization is not only practical but also produces accurate results that 
are beneficial for peer selection. 

1 Introduction 

The knowledge of end-to-end network performance is beneficial to many Inter- 
net applications pQ. To acquire such knowledge, there are two main challenges. 
First, the performance of a network path can be characterized by various metrics 
which differ largely. On the one hand, the wide variety of these metrics renders 
their processing difficult in applications. On the other hand, although having 
been studied for decades, network measurement for many metrics still suffers 
from high costs and low accuracies. Second, it is critical to efficiently monitor 
the performance of the entire network. As the number of network paths grows 
quadratically with respect to the number of network nodes, active probing of all 
paths on large networks is clearly infeasible. 

In this paper, we address these challenges by two distinct contributions: or- 
dinal rating of network performance and inference by matrix completion. 

Ordinal Rating of Network Performance. Instead of quantifying the per- 
formance of a network path by the exact value of some metric, we investigate the 
rating of network performance by ordinal numbers of 1, 2, 3, . . ., with larger value 
indicating better performance, regardless of the metric used. For the following 
reasons, ordinal ratings are advantageous over exact metric values. 
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— Ratings carry sufficient information that already fulfills the requirements of 
many Internet applications. For example, streaming media cares more about 
whether the available bandwidth of a path is high enough to provide smooth 
playback quality. In peer-to-peer applications, although finding the nearest 
nodes to communicate with is preferable, it is often enough to access nearby 
nodes with limited loss compared to the nearest ones. Such objective of 
finding "good-enough" paths can be well served using the rating information. 

— Ratings are coarse measures that are cheaper to obtain. They are also stable 
and better reflect long-term characteristics of network paths, which means 
that they can be probed less often. 

— The representation by ordinal numbers not only allows the rating information 
to be encoded in a few bits, saving storage and transmission costs, but also 
unifies various metrics and eases their processing in applications. 

Inference by Matrix Completion. We then address the scalability issue by 
network inference that measures a few paths and predicts the performance of 
the other paths where no direct measurements are made. In particular, we for- 
mulate the inference problem as matrix completion where a partially observed 
matrix is to be completed [2]. Here, the matrix contains performance measures 
between network nodes with some of them known and the others unknown and 
thus to be filled. Comparing to previous approaches [31415] . our matrix com- 
pletion formulation relies on neither structural information of the network nor 
geometric constraints. Instead, it exploits the spatial correlations across network 
measurements, which have long been observed in various research |4l5j . 

By integrating the ordinal rating of network performance and the matrix 
completion formulation, a great benefit is that the acquisition problem bears 
strong similarities to recommender systems which have been well studied in 
machine learning. Thus, a particular focus of this paper is to investigate the 
applicability of various recommender system solutions to our network inference 
problem. In particular, we found that the simple regularized matrix factorization 
is not only practical but also produces accurate results that are beneficial for 
the application on peer selection. 

Previous work on network inference focused on predicting the metric values 
of network paths. For example, |4l5j solved the inference problem by using the 
routing table of the network. In contrast, Vivaldi [5] and DMFSGD |6j built 
the inference models, without using network topology information, based on Eu- 
clidean embedding and on matrix completion. The same DMFSGD algorithm 
was adapted in [7] to classify network performance into binary classes of either 
"good" or "bad" . Based on [7j , this paper goes further and studies ordinal rat- 
ings of network performance and their inference by solutions to recommender 
systems. 

The rest of the paper is organized as follows. Section [2] describes the metrics 
and the rating of network performance. Section [3] introduces network inference 
by matrix completion. Section [4] presents the experimental results and the ap- 
plication on peer selection. Section [5] gives conclusions. 
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2 Network Performance 

2.1 Metrics 

End-to-end network performance is a key concept at the heart of networking pQ . 
Numerous metrics have been designed to serve various objectives. For example, 
delay-related metrics measure the response time between network nodes and are 
interested by downloading services. Bandwidth-related metrics indicate the data 
transmission rate over network paths and are concerned by online streaming. On 
the acquisition of these metrics, great efforts have been made and led to various 
measurement tools. However, the acquisition for some metrics still suffers from 
high costs and low accuracies. For example, measuring the available bandwidth 
of a path often requires to congest the path being probed. 

In this paper, we focus on two commonly-used performance metrics, namely 
round-trip time (RTT) and available bandwidth (ABW). 

2.2 Ordinal Rating 

Acquiring end-to-end network performance amounts to determining some quan- 
tity of a chosen metric. As mentioned earlier, ratings go beyond exact values in a 
number of ways. In addition, ratings reflect better the experience of end users to 
the Quality of Service (QoS), by which network performance should be defined. 

Ratings take ordinal numbers in the range of [1,-R], where R = 5 in this 
paper. The different levels of rating indicate qualitatively how well network paths 
would perform, i.e., l-"very poor", 2-"poor", 3-"ordinary" , 4-"good" and 5- 
"very good" . 

Generally, ratings can be acquired by vector quantization that partitions the 
range of the metric into R bins using R — 1 thresholds, r = {t\, . . . , tr_i}, and 
determines to which bins metric values belong, as illustrated in Figure [I] The 
thresholds can be chosen evenly or unevenly according to the requirements of the 
applications. Clearly, rating a path is cheaper than measuring the exact value as 
we only need to determine if the value is within a certain range defined by the 
thresholds. This holds for most, if not all, metrics, since data acquisition gener- 
ally undergoes the accuracy-versus-cost dilemma that accuracy always comes at 
a cost. The cost reduction is particularly significant for ABW. 

2.3 Intelligent Peer Selection 

For many Internet applications, the goal of acquiring end-to-end network perfor- 
mance is to achieve the QoS objectives for end users. Examples include choosing 
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Fig. 1. Examples of quantification of metric values into ratings of [1,5]. 
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low-delay peers to communicate with in overlay networks or choosing a high- 
bandwidth mirror site from which to stream media. In these examples, intelli- 
gent peer selection is desired to optimize services by finding for each node a peer 
that is likely to respond fast and well. 

The question is, to achieve this goal, should we use metric values or ratings 
of network paths? On the one hand, the knowledge of metric values allows to 
find the globally optimal node over the entire network. On the other hand, 
although the rating information only allows to find "good-enough" node, ratings 
are cheaper to obtain. Thus, it is interesting to study the optimality of peer 
selection based on ratings and on metric values. 

3 Inference by Matrix Completion 
3.1 Fundamentals of Matrix Completion 

Matrix completion addresses the problem of recovering a low-rank matrix from 
a subset of its entries. In practice, a real data matrix X is often full rank but 
with a rank r dominant components. That is, X has only r significant singular 
values and the others are negligible. In such matrix of rank r, denoted 

by X, can be found that approximates X with high accuracy [2]. 

Generally, matrix completion is solved by the low-rank approximation, 



fl is the set of observed entries and Pq is a sampling function that preserves 
the entries in fl and turns the others into 0. In words, we try to find a low-rank 
matrix X that best approximates X for the observed entries. 

The rank function is difficult to optimize or constrain, since it is neither 
convex nor continuous. Alternatively, the low-rank constraint can be tackled 
directly by adopting some compact representation. For example, as rank(X) ^ r, 



where U and V are matrices of n x r. Thus, we can look for the pair (U, V), 
instead of X, such that 



(MF). As the pair (U, V) has 2nr entries in contrast to the n 2 for X, matrix 
factorization is much more appealing for large matrices. 

3.2 Network Inference 

The network performance inference is formulated as a matrix completion prob- 
lem. In this context, X is a n x n matrix constructed from a network of n nodes. 
The entry Sy is some performance measure, a rating in our case, of the path 
from node i to node j. X is largely incomplete as many paths are unmeasured. 



P n {X) a P a (X) s.t. Rank(X) < r. 



(1) 



X = UV T , 



(2) 
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Fig. 2. The singular values of a RTT matrix and a ABW matrix, and of the corre- 
sponding rating matrices. The singular values are normalized by the largest one. 

Network inference is feasible because network measurements are correlated 
across paths. The correlations come largely from link sharing between network 
paths |4l5j . due to the topology simplicity in the Internet core where network 
paths overlap heavily. These correlations induce the related performance matrix 
to be approximately low-rank and enable the inference problem to be solved 
by matrix completion. We empirically evaluate the low-rank characteristics of a 
RTT and a ABW matrix by the spectral plots in Figure [2j It can be seen that the 
singular values of both the original matrices and of the related rating matrices 
decrease fast. This low-rank phenomenon has been consistently observed in many 
research |4I5I6I7| . 

As performance measures are ratings, the inference problem bears strong 
similarities to recommender systems which predict the rating that a user would 
give to an item such as music, books, or movies [5]. In this context, network 
nodes are users and they treat other nodes as items. In a sense, a rating is a 
preference measure of how a node would like to contact another node. 

A big motivation of recommender systems was the Netflix prize which was 
given to the BellKor's Pragmatic Chaos team in 2009 [5]. In the sequel, the prize- 
winning solution is called BPC. BPC integrated two classes of techniques based 
on neighborhood models and on matrix factorization. Neighborhood models ex- 
ploit the similarities between users and between items. Calculating similarities 
requires a sufficient number of ratings which may not be available in our prob- 
lem. Thus, we focus in this paper on the applicability of matrix factorization 
and leave the study on neighborhood models as future work. 

3.3 Matrix Factorization 

The goal of MF is to find U and V such that UV T is close to X at the observed 
entries in Q. This section discusses various MF models that were integrated in 
BPC including RMF, MMMF and NMF [5]. 

RMF Regularized matrix factorization (RMF) [8 adopts the widely-used Li 
loss function and solves 



a 



mm 



Xij U^Vj 



•j) 2 + X^muf + v t vf), 



(4) 



i=l 
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where Ui and Vi are row vectors of U and V and Xij is the r/th entry of X. The 
second term is the regularization which restricts the norm of U and V so as to 
prevent overfitting. A is the regularization coefficient. 
The unknown entries in X are predicted by 

£ij = UivJ , for ij ^ Q. (5) 

Note that Xij is real- valued and has to be rounded to the closest integer in the 
range of [1,R]. 

MMMF Max-margin matrix factorization (MMMF) solves the inference prob- 
lem by ordinal regression |10j . As RMF, the unknown entries in X are predicted 
by eq. [5j However, instead of rounding, the real- valued estimate x^ is related to 
the ordinal rating x^ by using R — 1 thresholds 8%, . . . , Or-i and requiring 

8 Xij -i < = mvj < 9 Xij , (6) 

where for simplicity of notation 9 — —oo and Or = oo. In words, the value of 
Xij does not matter, as long as it falls in the range of [8 r -i, 8 r ] for = r. 

In practice, it is impossible to have eq. [6] satisfied for every x^. Thus, we 
penalize the violation of the constraint and solve 

R n 

min ^2 ^2 l ( T iV ®r ~ U ' lV l) + X ^2( u ^ u f + Vi v T)> ( 7 ) 
ij^Q r—1 i—1 

where T[j = 1 if x^j ^ r and —I otherwise. Essentially, eq. [7] consists of a 
number of binary classifications each of which compares an estimate Xij with 
a threshold. The loss function / can be any classification loss function, among 
which the smooth hinge loss function is used. 

NMF Non-negative matrix factorization (NMF) jTT] incorporates an additional 
constraint that all entries in U and V have to be non-negative so as to ensure 
the non- negativity of X. 

Besides, NMF often uses the divergence to measure the difference between 
X and X, defined as 

D(X\\X) = :>.>«'''"' - + Xij). (8) 
ijen Xij 

Thus, NMF solves 

n 

mm D(X\\UV T ) + \J2(u lU J + v,vj). s.t. U ^ 0, V ^ 

i=l 

As RMF, iij is real-valued and has to be rounded to the closest integer in the 
range of [1,R]. 

MF ENSEMBLES The success of BPC built on the idea of the ensemble 
which learns multiple models and combines their outputs for prediction [9]. In 
machine learning, usually several different models can give similar accuracy on 
the training data but perform unevenly on the unseen data. In this case, a simple 
vote or average of the outputs of these models can reduce the variance of the 
predictions. In this paper, we combine the above RMF, MMMF and NMF. The 
final prediction result is the average of the predictions by different MF models. 
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3.4 Implementation Details 

Inference By Stochastic Gradient Descent In BPC, different MF models 
are solved by different optimization schemes, some of which are not appropriate 
for network applications where decentralized processing of data is necessary. In 
this paper, we adopted Stochastic Gradient Descent (SGD) for all MF models. 
In short, at each iteration, we pick Xij in fi randomly and update Ui and Vj by 
gradient descent to reduce the difference between x^j and UivJ . SGD is suitable 
for network inference, because measurements can be acquired on demand and 
processed locally at each node. We refer the interested readers to [5] for the 
details of the decentralized inference by SGD. 

Neighbor Selection In recommender systems, users rate items voluntarily. This 
is different in network inference where we do have control over data acquisition, 
i.e., network nodes can actively choose to measure some paths connected to them. 
Thus, we adopt the system architecture in |3I6I7| that each node randomly selects 
k nodes to probe, called neighbors in the sequel. 

Thus, each node collects k ratings from the paths connecting to its neighbors 
and infers the other unmeasured paths, k has to be chosen by trading off be- 
tween accuracies and overheads. On the one hand, increasing k always improves 
accuracies as we measure more and infer less. On the other hand, the more we 
measure, the higher the overhead is. Thus, as is in [7], we set k = 32 for networks 
of a few thousand nodes and k = 10 for a few hundred nodes, leading to about 
1 — 5% available measurements. 

Rank r The most important parameter in MF is the rank r. If a given X, 
constructed from a network, is complete, the proper rank can be studied by 
the spectral plot as Figure [2] under a given accuracy requirement. When X is 
incomplete, we can only search for the optimal r empirically. On the one hand, 
r has to be large enough so that enough information in X is kept. On the other 
hand, a higher-rank matrix has less redundancies and requires more data to 
recover, increasing measurement overheads. Thus, we choose a small value of 
r = 10 as we have only a limited number of measurements. 

4 Experiments and Evaluations 

The evaluations were performed on three datasets including Harvard, Meridian 
and HP-S3. Among them, Harvard contains dynamic RTT measurements col- 
lected from a network of 226 nodes in 4 hours, Meridian is a static RTT matrix 
of 2500 x 2500 and HP-S3 is a static ABW matrix of 231 x 231. More details 
about these datasets can be found in [7J. We adopted the common evaluation 
criterion used for recommender systems, Rooted Mean Square Error (RMSE), 





Note that the smaller RMSE is, the better. 
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4.1 Obtaining Ratings 

We first obtain ratings from the three datasets. To this end, we partition the 
range of the metric by the rating threshold r = {n, . . . , T4}. r is set by two 
strategies: 1. set r by the [20%, 40%, 60%, 80%] percentiles of each dataset; 2. 
partition evenly the range between and a large value selected for each dataset. 

Thus, for Strategy 1, r = [48.8, 92.2, 177.2, 280.3]ms for Harvard, r = [31.6, 
47.3, 68.6, 97.9]ms for Meridian, and r = [12.7, 34.5, 48.8, 77.9]Mbps for HP-S3. 
For Strategy 2, r = [75, 150, 225, 300]ms for Harvard , r = [25, 50, 75, 100]ms for 
Meridian, and t = [20, 40, 60, 80] Mbps for HP-S3. Strategy 2 produces quite 
unbalanced portions of ratings on each dataset. 

4.2 Comparison of Different MF Models 

We solved RMF, MMMF and NMF by SGD. The learning rate of SGD -q equals 
0.05, the regularization coefficient A equals 0.1, and the rank r is 10 for all 
datasets. The neighbor number k is 10 for Harvard and HP-S3 and 32 for Merid- 
ian. We do not fine tune the parameters for each dataset and for each model, as 
it is impossible for the decentralized processing. Empirically, MF is not very sen- 
sitive to the parameters as the inputs are ordinal numbers of [1, 5], regardless of 
the actual metric and values. For MF ensembles, we generate for each MF model 
several predictors using different parameters [5] . Although maintaining multiple 
predictors in parallel is impractical, MF ensembles produce the (nearly) optimal 
accuracy that could be achieved based on MF in a centralized manner. 

Table [l] and [2] show the RMSE achieved using different MF models and 
different r-setting strategies. Particularly, we made the following observations. 
First, RMF generally performs better than MMMF and NMF, MF ensembles 
perform the best. Second, the improvement of MF ensembles over RMF is only 
marginal, which is not considered worth the extra cost. Third, the accuracy on 
Harvard is the worst, which is probably because the dynamic measurements in 
Harvard were obtained passively, i.e., there was no control over when and which 
neighbors a node probed. Last, it is clear that different settings of r have some 
impacts to the accuracy, which need to be further studied. Nevertheless, we adopt 
Strategy 1 by default in the sequel. 

We would like to mention that for the Netflix dataset, the RMSE achieved by 
the Netflix's algorithm cinematch is 0.9525 and that by BPC is 0.8567 [9]. This 
shows that in practice, the prediction with an accuracy of the RMSE less than 1 
is already usable by applications. Thus, by trading off between the accuracy and 
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Table 1. RMSE with r set by strategy 1 Table 2. RMSE with r set by strategy 2 
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the practicability, the RMF model is adopted by default in our system. Table [3] 
shows the confusion matrices achieved by RMF on the three datasets. In these 
matrices, each column represents the predicted ratings, while each row represents 
the actual ratings. Thus, the off-diagonal entries represent "confusions" between 
two ratings. It can be seen that while there are mis-ratings, few have an error of 
— > 1) which means that the mis-ratings are under control. 

4.3 Peer Selection 

We demonstrate how peer selection can benefit from network performance pre- 
diction, based on ratings of [1, 5] in this paper, based on binary classes of "good" 
and "bad" in [7] and based on metric values in [6]. To this end, we let each node 
randomly select a set of peers from all connected nodes. Each node then chooses 
a peer from its peer set, and the optimality of the peer selection is calculated by 
the stretch [T2], defined as 



Si > 
Xio 

where • is the id of the selected peer, o is that of the true best-performing peer 
in node i's peer set and x« is the measured value of some metric. Si is larger 
than 1 for RTT and smaller than 1 for ABW. The closer Si is to 1, the better. 

Figure [3] shows the stretch of peer selection achieved based on value-based 
prediction, class-based prediction and our rating-based prediction. Random peer 
selection is used as a baseline method for comparison. It can be seen that on 
the optimality, value-based prediction performs the best and the performance by 
rating-based prediction is better than that of class-based prediction. This shows 
that the rating information is a good comprise between metric values and binary 
classes. On the one hand, ratings are more informative than binary classes and 
allow to find better-performing paths. On the other hand, ratings are qualitative 
and thus require less measurement costs than metric values. 

5 Conclusions 

This paper addresses the scalable acquisition of end-to-end network performance 
by network inference based on performance ratings. We investigated different 
matrix factorization models used in Recommender systems, particularly the so- 
lution that won the Netflix prize. By taking into account the accuracy and the 
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Table 3. Confusion matrices for Harvard (left), Meridian (middle) and HP-S3 (right). 
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Fig. 3. Peer selection by varying the number of peers in the peer set of each node. 
Recall that the stretch is larger than 1 for RTT and smaller than 1 for ABW. The 
closer it is to 1, the better. 



practicality, the simple regularized matrix factorization was adopted in the infer- 
ence system. Experiments on peer selection demonstrate the benefit of network 
inference based on ratings to Internet applications. 
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